A Novel Multi-core System for Fault Tolerance and Security with Diskless Checkpointing Capability
نویسنده
چکیده
Improving fault tolerance within the clusters has become vital because of the drastic decrease in the Mean Time Between Failures (MTBF) in complex clusters. Checkpointing is one of the robust ways to improve fault tolerance by rolling back from a saved state in the event of a failure in the cluster. Since, checkpointing primarily relies on storage devices for storing the states at regular intervals it will be difficult to accommodate several storage devices to support the ever increasing number of components in the cluster. This performance bottleneck is overcome by an independent checkpointing method called, Diskless Checkpointing. Ever since diskless checkpointing was launched commercially there were several research explorations that led to the division of diskless checkpointing into three namely, neighbor based, parity based and Reed-Solomon code based. All these algorithms were applied for recovering clusters from multiple processor failures. On the other hand, due to space and power constrains the multicore architecture have started to be used widely in the clusters. The future world of supercomputing envisions that most of the upcoming clusters will contain a majority of multicore processors or Chip MultiProcessors(CMP). This shows the necessity for a robust diskless checkpointing algorithm for recovering from failures in multicore architecture. Thus this paper aims at investigating the existing diskless checkpointing algorithms for multiple processor failures and devising a robust system for CMP.
منابع مشابه
Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملA Diskless Checkpointing Algorithm for Super-scale Architectures Applied to the Fast Fourier Transform
This paper discusses the issue of fault-tolerance in distributed computer systems with tens or hundreds of thousands of diskless processor units. Such systems, like the IBM BlueGene/L, are predicted to be deployed in the next five to ten years. Since a 100,000-processor system is going to be less reliable, scientific applications need to be able to recover from occurring failures more efficient...
متن کاملFault Tolerance and Checkpointing
Research and applications of clusters of workstations are growing rapidly. One of the major area is fault tolerance. This report describes two issues concerned: correctness and performance. After a number of techniques to improve performance are described, new research directions, diskless checkpointing and Java checkpointing, are introduced.
متن کاملAdaptive Checkpointing
Checkpointing is a typical approach to tolerate failures in today’s supercomputing clusters and computational grids. Checkpoint data can be saved either in central stable storage, or in processor memory (as in diskless checkpointing), or local disk space (replacing memory with local disk in diskless checkpointing). But where to save the checkpoint data has a great impact on the performance of a...
متن کاملFault Tolerant Matrix Operations for Networks of Workstations Using Multiple Checkpointing
Recently, an algorithm-based approach using diskless checkpointing has been developed to provide fault tolerance for high-performance matrix operations. With this approach, since fault tolerance is incorporated into the matrix operations, the matrix operations become resilient to any single processor failure or change with low overhead. In this paper, we present a technique called multiple chec...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012